Stems,Lemmas, Part of Speech (PoS) tags

Getting packages and allowing the, to automatically download useful files which will help.

This is quite new: some packages need quite large files to work well. When you run the code below, it might take a while (say… a minute). That’s because textstem will be downloading some files it needs to function properly.

About half way down the notebook there will be another moment like that, around “We start by downloading a language model”. But there it will take more like 5 minutes to download the file. Oh well… at least you only need to do this once ever!

WHEN YOU RUN THE CODE BELOW YOU MIGHT BE ASKED IF YOU’D LIKE TO RELOAD RSTUDIO: ANSWER ‘NO’

# for all packages we need:
#install.packages("pacman")
pacman::p_load(dplyr, stringr, udpipe, lattice, tidytext, readr, SnowballC, textstem)

Let’s get right to work. We’ll start by loading the data:

#Load the csv to a dataframe

file_path='./data/CORONA_TWEETS.csv'
Corona_NLP_DF <- read.csv(file_path)

#Converting into the tibble dataframe
mydata_TB <- as_tibble(Corona_NLP_DF) 

Stemming

From the video you already know what Stem is, and why it is important. Now let’s look at the code to make use it stems.

We will use the SnowballC package to carry out stemming. We will tokenise our text column to words and apply the stemmer wordStem. Have a look at the tibble it produces and compare the word and stem columns.

You can see more options using the help(wordStem). For example you can change the language and call different stemmers.

library(SnowballC)

mydata_TB %>%
  unnest_tokens(output = word, input= text) %>%
  mutate(stem = wordStem(word)) 
# have a look at the first few pages of the result below (especially the word and stem columns). Are any of them surprising? Why do you think they are this way? Would you have stemmed them differently?

And here’s an example of applying stems to a simple string. Notice that since stemming can be applied to individual words, we will have to split the string into words, then get the stems, then stitch the sentence back together. It’s a bit crude, but will show what we can do with stems. Below is a slightly simpler example waiting.

my_text_about_cats <- c("My cat is tired today. She is sleeping on her mat in the sun. She is dreaming about running and eating now. She usually wakes when she is hungry.")

stems <- my_text_about_cats %>%
  strsplit("\\s+") %>%
  unlist() %>%
  wordStem() %>%
  paste(collapse = " ")

stems
## [1] "My cat i tire today. She i sleep on her mat in the sun. She i dream about run and eat now. She usual wake when she i hungry."

Wouldn’t it be nice if the library could do the splitting into words for us? Of course! Indeed, this is R, so obviously there is a function for that! Indeed there are two functions we can use here: stem_words() and stem_strings()

Notice an important difference between those two methods: - stem_words() expects a vector of words, and will find stem in each item in that vector. So you usually have to do the work of preparing the data, but also you have more control over the result.

c("She", "is", "dreaming", "about", "running", "and", "eating", "now") %>% 
  stem_words()
## [1] "She"   "i"     "dream" "about" "run"   "and"   "eat"   "now"
  • stem_strings() is more forgiving, as it expects whole strings (e.g. sentences) and will stem each word in each of those sentences. But the result is a sentence, which gives you less fine control.
my_text_about_cats %>% 
  stem_strings()
## [1] "My cat i tire todai. She i sleep on her mat in the sun. She i dream about run and eat now. She usual wake when she i hungri."

Lemmatisation

Also from the video you know what Lemmas are. But now let’s see them in code.

Here we use the textstem package to produce lemmas. (This package will also do stemming). Just like with above examples there are two functions we’ll use: one expects vector as words, and one vector of more complex strings (which it will split into words by itself).

Below are some interesting examples:

# btw. depending on your R environment, you might need a syuzhet package
if (!require("syuzhet")){
  install.packages('syuzhet')
}
## Loading required package: syuzhet
## Warning: package 'syuzhet' was built under R version 4.4.3
library(syuzhet)

library(textstem)
vector <- c("run", "ran", "running", "walked", "walks", "walking")
lemmatize_words(vector) # takes collection of words!
## [1] "run"  "run"  "run"  "walk" "walk" "walk"

Luckily for us, lemmatize_strings can be applied to whole sentences (and will get the lemmas of each individual words by itself).

my_text_about_cats <- c("My cat is tired today. She is sleeping on her mat in the sun. She is dreaming about running and eating now. She usually wakes when she is hungry.")

lemmatize_strings(my_text_about_cats) # takes whole sentences (and can take many)
## [1] "My cat be tire today. She be sleep on her mat in the sun. She be dream about run and eat now. She usually wake when she be hungry."

Question to ponder: what would happen if you put a whole sentence in lemmatize_words? and why?

ACTIVITY: Stems vs. Lemmas

Try using this package to do stemming and compare to the results from above. Bring bits of code that take whole string from above and reduce it to stems and lemmas. Compare their differences. What do you see? Does it make sense in context of the video you’ve seen?

my_text_about_cats <- c("My cat is tired today. She is sleeping on her mat in the sun. She is dreaming about running and eating now. She usually wakes when she is hungry.")

# bring pieces of code from above here. One that will turn in into lemmas, one which will turn it into stems. What differences in output do you see?  
my_text_about_cats <- c("My cat is tired today. She is sleeping on her mat in the sun. She is dreaming about running and eating now. She usually wakes when she is hungry.")

#  your code here

Hint: Example code

Come back to this later: To understand this better, you can have a look at the help function for the functions and you will see you can call different dictionaries and lexicons to support stemming and lemmas and these change the results you get.

Part of Speech Tagging

There are many different POS tagging tools available in R. We will use UDPipe but you may also want to look at the R implementation of openNLP. UDPipe also does lemmatisation and dependency parsing.

To use UDPipe you need to use a language model.

We start with downloading a language model in English. UDPipe will work in other languages.

We then create data from our Corona tweets that we want to POS tag.

Next we are going to “annotate” that data with POS tags.

UDPipe produces to types of POS tags upos and xpos. upos is universal part of speech tagging and xpos is language specific

Brace brace! Big long download ahead (about 5 minutes)

#We start by downloading a language model

model_eng_ewt   <- udpipe_download_model(language = "english-ewt")
## Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/english-ewt-ud-2.5-191206.udpipe to F:/Paul Smith/04-r_testing/short-NLP-1-HSC-notes/english-ewt-ud-2.5-191206.udpipe
##  - This model has been trained on version 2.5 of data from https://universaldependencies.org
##  - The model is distributed under the CC-BY-SA-NC license: https://creativecommons.org/licenses/by-nc-sa/4.0
##  - Visit https://github.com/jwijffels/udpipe.models.ud.2.5 for model license details.
##  - For a list of all models and their licenses (most models you can download with this package have either a CC-BY-SA or a CC-BY-SA-NC license) read the documentation at ?udpipe_download_model. For building your own models: visit the documentation by typing vignette('udpipe-train', package = 'udpipe')
## Downloading finished, model stored at 'F:/Paul Smith/04-r_testing/short-NLP-1-HSC-notes/english-ewt-ud-2.5-191206.udpipe'
model_eng_ewt_path <- model_eng_ewt$file_model

#To load our downloaded model, use the udpipe_load_model() function:

model_eng_ewt_loaded <- udpipe_load_model(file = model_eng_ewt_path)

#Creating a text variable 
text <- Corona_NLP_DF$text %>% str_squish()

#Annotate data - this may take a moment
text_annotated <- udpipe_annotate(model_eng_ewt_loaded, x = text) %>%
      as.data.frame() 
    
#look at text_annotated - look at the extra columns upos and xpos give you the POS tags, 
text_annotated

We can now access the text_annotated data frame to graph the frequencies of POS tags and look at the highest occurring types. We do this for nouns and adjectives.

# Now you can display the most popular parts of speech:
txt_freq(text_annotated$xpos)

Pre-Activity: Visualising Parts of Speech Tagging

First read the code below. What does it do? Once you understand it enough there is a task waiting for you below, where you will have to change it:

#Look at the frequency of the POS tags
freq <- txt_freq(text_annotated$xpos)
print(freq)
##      key  freq     freq_pct
## 1     NN 28259 19.175544548
## 2     IN 14504  9.841894551
## 3     DT  9000  6.107077424
## 4    NNS  8766  5.948293411
## 5    NNP  8202  5.565583226
## 6     JJ  7792  5.287371921
## 7      .  7171  4.865983579
## 8     VB  6886  4.672592794
## 9     RB  6565  4.454773699
## 10   PRP  5713  3.876637036
## 11     ,  5095  3.457284386
## 12   VBP  4284  2.906968854
## 13    CC  4209  2.856076542
## 14   VBG  3780  2.564972518
## 15    CD  3418  2.319332293
## 16   VBZ  2955  2.005157088
## 17    TO  2493  1.691660446
## 18   VBN  2256  1.530840741
## 19  PRP$  1941  1.317093031
## 20   VBD  1921  1.303521748
## 21    MD  1773  1.203094253
## 22   ADD  1308  0.887561919
## 23    RP  1018  0.690778313
## 24   WRB   699  0.474316347
## 25  HYPH   697  0.472959218
## 26    UH   643  0.436316754
## 27    WP   640  0.434281061
## 28     :   624  0.423424035
## 29    GW   395  0.268032843
## 30   JJR   387  0.262604329
## 31 -RRB-   345  0.234104635
## 32  NNPS   343  0.232747506
## 33   NFP   327  0.221890480
## 34   WDT   318  0.215783402
## 35   JJS   313  0.212390582
## 36 -LRB-   282  0.191355093
## 37    EX   225  0.152676936
## 38   PDT   216  0.146569858
## 39    ``   209  0.141819909
## 40    ''   200  0.135712832
## 41    LS   194  0.131641447
## 42   RBR   185  0.125534369
## 43   POS   179  0.121462984
## 44     $   178  0.120784420
## 45   SYM   164  0.111284522
## 46    FW   124  0.084141956
## 47   RBS    95  0.064463595
## 48   AFX    76  0.051570876
## 49   WP$     3  0.002035692

and now let’s visualise that data:

#Create a barcharts to look at the frequencies of upos types of POS tags
freq.distribution.upos <-
  txt_freq(text_annotated$upos)

freq.distribution.upos$key <-
  factor(freq.distribution.upos$key,
         levels = rev(freq.distribution.upos$key))

barchart(
  key ~ freq,
  data = freq.distribution.upos,
  col = "dodgerblue",
  main = "UPOS frequencies",
  xlab = "Freq"
)

## NOUNS
nouns <- subset(text_annotated, upos %in% c("NOUN")) 
nouns <- txt_freq(nouns$token)

nouns$key <- factor(nouns$key, levels =
                      rev(nouns$key))

barchart(key ~ freq, data = head(nouns, 20), 
         col ="cadetblue", 
         main = "Most occurring nouns", 
         xlab = "Freq")

## ADJECTIVES
adj <- subset(text_annotated, upos %in% c("ADJ"))

adj <- txt_freq(adj$token)

adj$key <- factor(adj$key, levels = rev(adj$key))

barchart(key ~ freq, data = head(adj, 20), 
         col = "purple",
         main = "Most occurring adjectives", 
         xlab ="Freq")

Activity: Playing with Visualising Parts of Speech Tagging

Now it’s your turn. Copy-paste the code above and make changes to it: Add another type of POS tag in the code - copy the block for NOUN or ADJECTIVE and replace the POS tag type for another POS tag type.

# code can come here

Reflection:

Now is a good moment to write down your self reflection: think of 3 STARS (things that you learned in this badge), and 1 WISH: a thing you wish you understood better. You might also thing of what would you do to fulfil your wish. Write them down.

Conclusion:

Now you’ve seen using Lemmas, Stems and POS in action. It will open a whole new world of practical NLP.